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Abstract 

Despite the conceptual simplicity of sequential consistency (SC), 
the semantics of SC atomic operations and fences in the Cll and 
OpenCL memory models is subtle, leading to convoluted prose 
descriptions that translate to complex axiomatic formalisations. We 
conduct an overhaul of SC atomics in Cl 1, reducing the associated 
axioms in both number and complexity. A consequence of our 
simplification is that the SC operations in an execution no longer 
need to be totally ordered. This relaxation enables, for the first 
time, efficient and exhaustive simulation of litmus tests that use 
SC atomics. We extend our improved Cll model to obtain the first 
rigorous memory model formalisation for OpenCL (which extends 
Cll with support for heterogeneous many-core programming). In 
the OpenCL setting, we refine the SC axioms still further to give 
a sensible semantics to SC operations that employ a ‘memory 
scope’ to restrict their visibility to specific threads. Our overhaul 
requires slight strengthenings of both the Cll and the OpenCL 
memory models, causing some behaviours to become disallowed. 
We argue that these strengthenings are natural, and that all of the 
formalised Cll and OpenCL compilation schemes of which we are 
aware (Power and x86 CPUs for Cll, AMD GPUs for OpenCL) 
remain valid in our revised models. Using the Herd memory 
model simulator, we show that our overhaul leads to an exponential 
improvement in simulation time for Cll litmus tests compared 
with the original model, making exhaustive simulation competitive, 
time-wise, with the non-exhaustive CDSChecker tool. 

Categories and Subject Descriptors T).2iA [Programming Lan¬ 
guages]: Formal Definitions and Theory; D.3.3 [Programming 
Languages]: Language Constructs and Features; F.3.2 [Logics and 
Meanings of Programs]: Semantics of Programming Languages 

Keywords Formal methods, graphics processing unit (GPU), het¬ 
erogeneous programming, HOL theorem prover, language design, 
program simulation, weak memory models 



1. Introduction 

Atomics and memory models. Cll and OpenCL both define a 
collection of atomic operations, or ‘atomics’, which can be used 
by experts to program high-performance, lock-free algorithms in a 
portable manner. Atomics accept a memory order parameter, which 
controls the exposure of certain relaxed memory behaviours that 
modern CPUs and GPUs natively exhibit. 

The Cll and OpenCL specifications EDim define the seman¬ 
tics of atomics via axiomatic memory models’, that is, sets of rules 
that govern the reading and writing of shared memory locations. 
These memory models are complex, stretching to about 19 and 
30 pages, respectively, of convoluted prose. This complexity makes 
it extremely challenging to reason about the correctness of programs 
that are written in, and compilers that implement, these languages. 

Correctness in any relaxed memory setting is notoriously eva¬ 
sive; indeed, the subtleties of relaxed memory have previously led 
to confirmed bugs in language specifications fTlIlOI . deployed pro¬ 
cessors m , compilers t27ll40l and vendor-endorsed programming 
guides O. The importance of correctness in the context of Cl 1 is 
well-known. Correctness is just as crucial in OpenCL, which is an 
open standard for heterogeneous programming that is developed 
and supported by major hardware vendors such as Altera, AMD, 
ARM, Intel, Nvidia, Qualcomm and Xilinx. OpenCL is a key player 
in the recent drive to exploit GPUs and FPGAs in general-purpose 
computing, including in safety-critical domains such as medical 
imaging 01 and autonomous navigation m. 

We seek in our work to tame the complexity of these memory 
models through formalisation. 

The Cll memory model has been formalised by several researchers, 
in varying degrees of completeness, and with varying degrees of 
fidelity to the standard GlElOll. These formalisation efforts have 
proved fruitful; they have, for instance, enabled the construction 
of simulators that automatically explore the allowed behaviours of 
small Cll programs (called litmus tests) miTiiniiii, underpinned 
the design of program logics for specifying and verifying Cll 
programs I37II38I . and they provide a firm foundation for ongoing 
debate about the design of the Cll memory model itself I10ll39l . 
The OpenCL memory model (introduced in version 2.0 of the 
standard) has received comparatively little academic attention, with 
the notable exception of the work of Gaster et al. im, which we 
discuss further in Q OpenCL provides a framework for CPU 
programs to delegate the execution of massively-parallel kernel 
functions, written in a variant of C, to one or more accelerator 
devices, such as GPUs or FPGAs. Threads that execute these kernels 
are organised into a hierarchy: thread^ are grouped into work¬ 
groups, and work-groups are grouped by device. The OpenCL 
memory model is broadly similar to that of Cll, but is extended 

* Threads in OpenCL are also called work-items. 



with features such as memory regions (which contain locations that 
are accessible only to a certain subtree of the thread hierarchy), 
and memory scopes (which, when applied to an atomic operation, 
confine its visibility to a certain subtree of threads). 

SC atomics. Our work is distinguished by its focus on the 
sequentially consistent (SC) fragment of these memory mod¬ 
els; that is, the semantics of atomics whose memory order is 
memory_order_seq_cst. The chief guarantee provided by this 
memory order is that all SC atomics in a given execution will ex¬ 
ecute in some order (say, S) on which all threads mutually agree. 
Note that these memory models do not construct S; they merely 
postulate the existence of a suitable S. 

Sequential consistency is known for its simplicity 1261 . and in¬ 
deed, any Cl 1 or OpenCL program using exclusively SC atomics 
would enjoy a simple interleaving semantics. However, when com¬ 
bined with the more relaxed memory orders that Cl 1 and OpenCL 
also provide, the semantics of SC atomics becomes highly complex, 
and it is this complexity that we tackle in this paper. 

SC atomics are in widespread use, partly because the SC mem¬ 
ory order is used when no other is specified, and partly because 
programmers are routinely advised to use SC atomics prior to opti¬ 
mising their code with the more relaxed memory orders I42l (r). 221)]. 
Algorithms that make use of SC atomics include Dekker’s mutual 
exclusion algorithm m, and more generally, multiple-producer- 
multiple-consumer algorithms that require every consumer to ob¬ 
serve the actions of every producer in the same order|^As such, it is 
important that the semantics of SC atomics is clear to programmers, 
to allow smooth transitioning between the exclusive use of SC (for 
ease of reasoning) to a mixture of SC and weaker-than-SC atomics 
(for performance optimisation). 

In theory, SC atomics can be avoided by replacing them with 
mutex-protected non-atomic operations (and simple spinlock mu- 
texes can be implemented using just release and acquire atom¬ 
ics 1421 fp. 111)]). In practice, support for SC atomics is non- 
negotiable if software is to make use of concurrency libraries. This 
is because the aforementioned replacement of SC atomics must be 
performed throughout the entire program - in both library code 
and client code alike - and with the same mutex variable for every 
operation. Moreover, programs that make extensive use of spinlocks 
could prove less efficient than those that rely on native SC atomics, 
and accidental misuse of locks may lead to deadlock. 

1.1 Main Contributions 

Our work aims to provide clearer, simpler foundations for reasoning 
about Cl 1, enabling a clean extension to OpenCL for heterogeneous 
programming, and facilitating efficient simulation. 

1. Overhauling SC atomics in Cll (^. The Cll specification 
devotes around 276 words to explaining the semantics of SC atomics. 
In our work, we have translated these words into mathematical 
axioms, carefully strengthened these axioms (without imposing 
unreasonable demands on the compiler), and then refactored them 
so that they are expressed as simply as possible. Our revised text 

/ is shorter (requiring just 80 words in the same prose style), 

/ is simpler (because it reduces seven axioms to just one), and 
/ is amenable to more efficient simulation (see below). 

Supporting the revised text is a provably-equivalent model that 
avoids the need to postulate the total order S. Instead, the model 
constructs a partial order on SC operations, preserving only the 
edges of S that can affect program behaviours. The enumeration 
of all candidate S relations is one of the most expensive tasks for 


^ http://en.cppreference.com/w/cpp/atomic/memory_order 


memory model simulators like Herd; by reducing S' to a partial 
order, we can dramatically improve simulation performance. 

2. Overhauling SC atomics in OpenCL (^. Our simplifications 
to the rules governing SC atomics in Cl 1 can be carried over directly 
to OpenCL, where the same three benefits listed above can be reaped. 
In the OpenCL setting, however, there is an additional complexity 
in the semantics of SC atomics. Specifically, the total order S in 
which all of a program’s SC atomics execute is only guaranteed 
to exist when one of two conditions holds: either all SC atomics 
in the program’s execution use the widest-possible memory scope 
and only access memory shared between devices, or all SC atomics 
have their memory scope limited to the current device and never 
access memory shared between devices. We find that this semantics 
is unhelpful to programmers, because if any SC atomic violates 
these conditions, then no SC atomic is guaranteed to have semantics 
stronger than acquire/release; this may lead to additional behaviours 
not anticipated by the programmer. The semantics is simultaneously 
unhelpful to compiler-writers: a loop-hole that we discovered in the 
second condition above means that even device-scoped SC atomics 
must be implemented using expensive inter-device synchronisation. 

We have amended the rules that govern SC atomics in OpenCL, 
so that the SC guarantees do not vanish immediately in the presence 
of a differently-scoped SC atomic somewhere in the program, but 
instead degrade gracefully. Our revised rules 

/ are shorter and simpler (we can replace 391 words in the 
specification with 89 words in the same prose style), 

/ enable new programming patterns in OpenCL (such as programs 
that use SC atomics in a natural manner, yet a manner that 
violates the overly restrictive conditions above), 

/ let device-scoped SC atomics be efficiently implemented, and 

/ improve the compositionality of OpenCL semantics, and hence 
the ability to write concurrency libraries (because the behaviour 
of SC atomics no longer depends on unstable, global conditions). 

3. Proving the implementability of our revised models ( jd.dj ^5.2) . 
Our improvements to the SC axioms in the Cll and OpenCL 
memory models hinge on slight strengthenings of the models; that is, 
tweaking some of the axioms so that fewer executions are allowed. 
This increases the demands on compilers that implement these 
memory models, so it is important to check that our changes do 
not invalidate existing compilation schemes. To this end, we prove 
that all of the formalised Cll compilation schemes of which we are 
aware (namely, those for Power m and x86 m machines) remain 
sound after our changes, and we argue informally that our OpenCL 
changes preserve the soundness of the only formalised OpenCL 
compilation scheme (namely, that for AMD GPUs ED). 

1.2 Supporting Contributions 

In order to justify the claims we make in our main contributions, we 
have established several supporting artefacts, which we believe are 
also valuable in their own right. 

4. Formalising the OpenCL memory model (^. The OpenCL 
specification contains numerous ambiguities, omissions and incon¬ 
sistencies, which makes it a shaky structure upon which to build an 
argument about the correctness of an OpenCL program or compiler. 
The lack of clarity may lead programmers and compiler-writers 
to cautiously opt for low-efficiency implementations that are eas¬ 
ier to guarantee correct. Moreover, there are instances where the 
OpenCL specification authors have made unnecessarily conserva¬ 
tive, programmer-unfriendly decisions in the design of rules for 
memory consistency. We provide the first mechanised formalisa¬ 
tion of the OpenCL memory model. Our formalisation serves to 
clarify the specification, and can henceforth be used to underpin 




future program logics for verifying OpenCL kernels, and to inform 
further refinements to the memory modelj^In particular, we use our 
rigorous memory model to show that the design decisions of the 
specification can be made less conservative, offering programmers 
more flexibility, without placing any additional burden on efficient 
implementation of the language. 

5. Formalising the memory models in .cat We have 

encoded the Cl 1 and OpenCL memory models in the . cat frame¬ 
work ii. Previous formalisations of the Cl 1 memory model exist, 
in Isabelle m, Lem O and Coq 1381 : here we contribute the first 
version in . cat. We conduct our development work in the . cat 
language because it is the native input format to the Herd memory 
model simulator, which has a proven record of efficiently simulating 
a range of CPU machine-level memory models |^(§8.3)]. 

6. Extending the Herd memory model simulator During 

memory modelling work, tool support for simulating alternative 
memory models against litmus tests is invaluable. Herd is able 
to simulate any memory model expressed as a . cat file, but in 
its original incarnation, it supported only machine-level models of 
CPUs (3 and GPUs (3). To explore our proposed changes to the Cl 1 
and OpenCL memory models, we extended Herd with a module for 
generating executions of Cl 1 and OpenCL programs, and support 
for language-level memory models that incorporate ‘undefined 
behaviour’ (a notion that is absent from machine-level models). This 
involved adding around 8000 lines to the original Herd codebase|^ 
All of the examples in this paper have been automatically checked 
with Herd. Using Herd, we have evaluated the impact of our 
changes to the SC axioms, and found an exponential improvement 
in simulation performance. 

Online material. Our companion webpage provides instructions 
for downloading Herd and our . cat formalisations (m. 

2. The Cll Memory Model in . cat 

This section describes formally the current Cll memory model. 

The semantics of multi-threaded Cll programs is formalised 
in two stages; the first concerning the thread-local semantics, and 
the second capturing the memory model. Roughly speaking, the 
first stage takes as input a Cll program and calculates its set 
of executions ( ^2.1| ^2.2| ); the second stage then compares each 
execution to the memory model to determine which executions are 
actually allowed ( ^2.3| >. 

There exist several prior formalisations of the Cll memory 
model diEiiia- The novelty of this section is the first comprehen¬ 
sive formalisation of the model in the . cat framework m , which 
enables the use of the efficient Herd simulator (3. For reasons 
of space, and because they are orthogonal to the thrust of our con¬ 
tributions, we omit our treatment of the ‘consume’ memory order, 
unsequenced races and Cll locks from the paper. However, our 
. cat-based formalisation fully accounts for these features, and is 
provided on our companion webpage (m. 

2.1 Cll Programs 

A Cl 1 program manipulates a set of shared memory locations. 

Definition 1 (Memory locations). Each memory location is de¬ 
clared with either a non-atomic or an atomic type. That is, type{l) € 
{atomic, non-atomic} for every memory location 1. 

Definition 2 (Structure of Cll programs). We consider Cll pro¬ 
grams of the form P = lljgy Pt, where T is a set of thread identi¬ 
fiers, pt is a piece of sequential code, and || is parallel composition. 

^ Indeed, we have already built upon our formalisation in another piece 
of work that investigates a proposed extension to the OpenCL memory 
model ET]. “ As estimated by git log. 


(This static form of parallelism is a simplification of the dynamic 
thread creation that Cll actually provides.) 


Atomic locations can be accessed via atomic operations', these 
include reads, writes, and read-modify-writes (RMWs). Cll also 
defines/ence operations. Atomic operations and fences expose the 
programmer to relaxed memory behaviours; which behaviours are 
exposed is controlled by the operation’s memory order parameter. 


Definition 3 (Memory orders). The available memory orders in 
Cll are: 


o ::= RLX 
j ACQ 
I REL 
I AR 
i SC 


(relaxed) 

(acquire, only for reads/RMWs) 
(release, only for writes/RMWs) 
(acquire-l-release, only for RMWs) 
(sequentially consistent, the default). 


Example 1 (A Cll program). We give below a contrived Cll 
program that operates on two atomic locations, x and y, using atomic 
store and load operations with a variety of memory orders. 

atomic.int ♦x; atomic_int *y; 

store(x,1,RLX);11rl=load(x,RLX); 11store(x,2,SC); 11 store(y,1,SC); 

11r2=load Cx,RLX); 11r3=load(y,SC); 11r4=load Cx,SC); 


2.2 Cll Executions 

The Cll memory model is defined in terms of program executions. 
An execution X takes the form of a mathematical graph, where 
each node e £ E is, labelled with a run-time memory event (see 
Def.|^, and the edges connect events performed by the same thread 
in program order. In other words, an execution is a partial order over 
a set E of events, and can be thought of as a ‘concurrent trace’. 

Definition 4 (Event labels). Each event’s label characterises the 
kind of instruction that gave rise to the event, and incorporates up 
to four attributes, as listed in the first five columns of the following 
table: 


kind 

loc 

rval 

wval 

ord 

R 

W 

F 

A 

Wna 

(/, 


V, 

) 


/ 



W 

{i, 


V, 

0 ) 


/ 


/ 

Rna 

{1, 

V, 


) 

/ 




R 

(1, 

V, 


0 ) 

/ 



/ 

RMW ( 1, 

V, 


0 ) 

/ 

/ 


/ 

F 

( 



0 ) 



/ 
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The labels represent (reading down): non-atomic writes, atomic 
writes, non-atomic reads, atomic reads, RMWs (which are always 
atomic), and memory fences. Where relevant, labels contain (reading 
across): the location being accessed, the value being read, the value 
being written, and the memory order specified by the programmer. 
A /-mark on the right-hand side of the table indicates that an event 
with this label belongs to the set R (resp. W, F, A) of events that 
read (resp. write, are a fence, are atomic). Let £ denote the set of 
labels. 

Definition 5 (Executions). An execution is a tuple X = {E, I, Ibl, 
thd, sb) with the following components. 

• E is a set of event identifiers. 

• Ibl £ E ^ C associates each event with a label. For each 
event e, loc{e) projects the loc attribute of lbl{e) (if applicable); 
rval{e), wval{e) and ord{e) provide similar projections. 

• / C E is a set of initial events. Every initial event e G / is a non- 
atomic write of zero; that is, kind{e) = Wna and wval{e) = 0. 
Moreover, there is exactly one initial event per location. 

• thd Q {E\ 1)^ is an equivalence relation on non-initial events 
that relates events from the same thread. 





• s6 C thd is the sequenced-before relation: a strict partial order 
(i.e., irreflexive and transitive) between events from the same 
thread, that captures the program order. 

Let X be the set of all executions. Next, we define a number of 
derived sets and relations over the events of an execution that will 
prove useful in describing the memory model. 

Definition 6 (Derived sets and relations). In the context of 
an execution [E, /, Ibl, thd, sb), we define the relation =ioc as 
{(e, e') € (f? \ F)'^ \ loc{e) = loc{e')}; it holds between 
non-fence events that access the same location. The relation =vai, 
defined as {(e, e') G W x R \ wval{e) = rval{e')}, holds 
when the first event writes the value that the second reads. For 
each memory order o G {RLX, ACQ, REL, AR, SC}, we abbreviate 
the set {e G I ord{e) = o} as just o. We also define the set 
nal = {e G i? \ T' I type{loc{e)) = non-atomic} of events that 
access a non-atomic location. 

Example 2 (A C11 execution). The diagram below depicts one 
execution of the program given in Example [T] The initial events, 
a and 6, are placed above the events of the four parallel threads. 
Reflexive and transitive edges are elided, and derived relations are 
not shown. 


a:Wna(x,0) 6:Wna(y,0) 

c:W(x, 1,RLX) d: R(x, 1,RLX)/: W(x,2,SC) /i:W(y, 1,SC) 

thd^ thd^ thd^ 

e:R(x,2,RLX) 5:R(y,0,SC) i: R(x, 1, SC) 

Basic executions. The first stage of the Cl 1 semantics translates a 
program into a set of executions called its basic set|^Each execution 
in this set is compatible with the instructions of the individual 
threads, but the set is constructed without considering the behaviour 
of shared memory, so it provides an over-approximation of the 
executions that will ultimately be allowed to happen once the whole 
program and the memory model are taken into account. For instance, 
the execution in Example|^is a basic execution of the program in 
Example[T] the values of the write events correspond to the program 
text, but the values of the read events are arbitrary and the basic set 
of all executions ranges over all choices. We do not define formally 
how the basic executions are constructed, and simply assume their 
existence for any program we wish to consider. Practical tools such 
as Herd and Cppmem m implement this construction as part of 
litmus test simulation; the construction is investigated formally in 
ongoing work by Memarian et al. 

Candidate executions. The second stage of the Cll semantics, 
which is the focus of this paper, takes as input a program’s basic 
execution set and returns the set of allowed executions. In order to 
build the allowed executions, we employ an intermediate structure 
called a candidate execution, which extends an execution with a 
witness that comprises three additional relations, called rf (reads- 
from), mo (modification order) and S (sequential consistency 
order). 

Definition 7 (Candidate executions). A candidate execution is 
a pair (X,w) where X = {E, I, Ibl, thd, sb) is an execution, 
and w = (rf, mo, S) is a witness comprising three relations 
rf, mo, S C E^. A candidate execution is well-formed, written 
wf{X, w), if: 

• the reads-from relation links write events to read events, such 
that every read observes exactly one write, and the locations and 


^ This set is sometimes called the ‘pre-executions’ (3 or the ‘opsems’ (H. 


values match; that is, 

Ve G R.3!e' G W.(e',e) G rf 
and rf C (=1^^ n =vai) 


(WfRf) 


where 3! means ‘exists unique’; 

the modification order relates, in a strict total order, all and only 
those events that write to the same atomic location; that is, 

(mo U mo~^) = {=ioc n \ nal^ \ id) 
and acy(mo) 


(WfMo) 


where acy(r) means that r is acyclic; and 
• the S relation relates, in a strict total order, all and only the SC 
events in an execution; that is, 

acy(S') and {S U S~^) = (SC^ \ id) (WfS) 


Example 3 (A Cll candidate execution). The diagram below 
extends the execution in Example]^ with a witness. We elide the 
thd edges (each column corresponds to one thread). The candidate 
execution is well-formed, and consistent with the axioms of the 
memory model (presented next). 



2.3 Cll Axioms 

A candidate execution is deemed consistent with the memory model 
if it satisfies the 12 consistency axioms of Def. m which we shall 
build towards in this subsection. We express the axioms using the 
. cat language (2), a concise language based on the propositional 
fragment of Tarski’s relation calculus OH. 

Definition 8 (The . cat language). The cat language supports 
the construction of relations via: union, intersection, difference, 
complement (-ir), inverse (r“^), reflexive closure (r'), transitive 
closure (r'^), and relational composition (ti ; r 2 ), which is defined 
such that {x, z) £ ri ; r 2 if {x, y) G n and (y, z) G r 2 for some 
y. It also provides the syntax [s] = {(e, e) | e G s} for the 
identity relation (id) restricted to the set s. (These operators can 
be neatly combined to describe paths through graphs; for instance, 
[si] ; ri ; [ 52 ] ; f 2 ; [sa] relates si-events to those sa-events that 
are reachable by following an ri-edge to an S 2 -event and then an 
r 2 -edge.) Each axiom of the memory model must be expressed in 
the form of an acyclicity (acy r), irreflexivity (irr r), or emptiness 
(empty r) constraint on some relation r constructed using these 
operators. 

In order to define these axioms, we first need to introduce several 
derived relations. 

Remark 9. In the following, we justify our formal definitions 
by referring to the Cll standard 1211 . using the notation %N\n 
for section N, paragraph n. We refer to the C-l-l-11 standard 1201 . 
whenever a clause was erroneously omitted from Cl 1. (Cl 1 inherits 
its memory model from C-H-11). Similarly, we refer to the C-h- 14 
standard Ea in the case of an erroneous omission from C-l-l-11. 
We include these omitted parts because doing so leads to a cleaner 
model that we believe to be closer to the designers’ intent. 

Definition 10 (Eurther derived sets and relations). In the context of 
a candidate execution (E, 1, Ibl, thd, sb, rf, mo, S), we define the 





following subsets of E and relations over E: 


acq 

def 

ACQ U ARU (SC n (77 U F)) 

rel 

def 

REL U AR U (SC n ( VF U F)) 

fr 

def 

rf~^ ; mo 

Fsb 

def 

[F] ■ sb 

sbF 

def 

sb ■ [77] 

/ 

rs 

def 

thdU{E^ ■ [77 n W]) 

rs 

def 

mo n rs' \ {{mo \ rs') ; mo) 

sw 

def 

{[rcl] ; Fsf ; [A n IF] ; rs’ ; rf ; 



[77 n A] ; sbF^ ; [acq]) \ thd 

hb 

def 

{sb U (7 X -if) U sw)^ 

hbl 

def 

hb n =ioc 

vis 

def 

{W X R)nhbl\{hbl-[W]-,hb) 

cnf 

def 

{{W X W) U ( IF X 77) U (77 X IF)) n 

dr 

def 

cnf \ hb \ hb~^ \ \ thd 


Commentary. The set acq (resn. rel) contains all events that behave 
as an acquire (resp. a release)lj The/rom-rear/ relation (Jr) links 
each read to all those writes that are mo-after the write the read 
observed (2). 

The relation rs captures the release sequence, using rs' as a 
helper. The release sequence of e comprises those events that form 
a maximal mo-chaiUjStarting from e, of events that either are in e’s 
thread or are RMWsQ 

Release/acquire synchronisation is captured by the sw relation. 
This relates an atomic write-release event to an atomic read-acquire 
event in a different thread if the read obtains its value from the write 
or its release sequence|^If the acquire (resp. release) is a fence, the 
synchronisation happens via an atomic read (resp. write) sequenced 
before (resp. after) the fence]^ 

Happens-before (hb) is a transitive relation that includes 
sequenced-before and synchronisation edges, and puts initial events 
before all other eventsljWe use hbl to abbreviate happens-before 
to events on the same location. A write is visible (vis) to a read if it 
is the most recent write to that location in happens-before 

Two events are in conflict (cnf) if they access the same location 
and at least one is a writer^ these events go on to form a data 
race (dr) if they are unrelated by happens-before, they are not both 
atomic, and they are in different threads 

We now use the derived relations of Def.[T0]to formalise what it 
means for an execution to be consistent. 

Definition 11 (Consistency). A candidate execution {X , w) = 
{E, I, Ibl, thd, sb, rf, mo, S) is consistent, written consistent(X, 
ui), if it is well-formed and it satisfies all of the following axioms: 

irr(/ife) (Hb) 

irr((r/"^)- ; mo ; rf ; hb) (Coh) 

irr(r/ ; hb) (Rf) 

empty((r/ ; [nal]) \ vis) (NaRf) 

irr(r/ U {mo ; mo ; rf~^) U {mo ; rf)) (Rmw) 

iTr{S ; ri) where ri = hb (SI) 

® (2T] (§7.17.3:3^)], (2T] (§7.17.4.1:2)] (2T] (§5.1.2.4:10)] 

** (2l|(§5.1.2.4:ll)] (2T](§7.17.4:2^)] "> |2T|(§5. 1.2.4:18)], simpli- 

fied in the absence of memory_order_consunie “ (2l](§5.1.2.4:19)] 
>2 (2T](§5.1. 2.4:4)] ‘3 (2T|(§5. 1.2.4:25)] 


irr)^ ; r2) 
irr)^ ; rs) 

irr((S' \ (mo ; S)) ; rf 
irr)^ ; rf 
irr(S' ; rf 
irr)^ ; ry) 


where r^ = Fsf ; mo ; sbF^ (S2) 
where rs = rf~^ ; [SC] ; mo (S3) 
where r 4 = rf~^ ; hbl ; [W] (S4) 

where rs = Fsb ; fr (S5) 

where r^ = fr ; sbF (S 6 ) 

where ry = Fsb ; fr ; sbF (S7) 


Commentary. These axioms are equivalent to those in Batty et al.’s 
Lem formalisation O, the fidelity of which has been endorsed 
by the C 11 standards committee, but because they are expressed 
in the . cat language, they are markedly more concise. We have 
established this equivalence using the HOL theorem prover, with 
the help of a tool we wrote for exporting . cat files to Lem, and our 
proof script is available online DU . We now explain each axiom in 
turn. 

Happens-before must contain no cyclesp^Requiring irreflexivity 
here is sufficient (Hb), since hb is transitive. Coherence (Coh) 
governs the relationship between hb and mo: if the write ei is 
mo-before the write 62 , then 62 (and any events that read from 62 ) 
must not happen before ei (nor before any events that read from 
er)I 3 A read must not observe a write that happens after it (Rf)p^ 
and a read of a non-atomic location must observe a visible write 
(NaRf).^ An RMW must observe the immediately-preceding write 
in mo (Rm\A/)p]that is, not itself (first disjunct), nor a too-early 
write (second disjunct), nor a too-late write (third disjunct). 

This leaves the SC axioms, which we present using where- 
clauses for ease of reference later. Axiom SI states that S must 
be consistent with happens-beforeAxiom S2 governs the relation¬ 
ship between S and mo: if the write ei is mo-before the write 62 , 
then 62 (and any fences sequenced after 62 ) must not come before 
ei (nor before any fences sequenced before ei) in 

Axioms S3 and S4 constrain the values that an SC read ei of a 
location I may observe. If there are any SC writes to I preceding ei 
in S, then ei must read either from the most recent of these in S' - 
call this 62 - or from a non-SC write that does not happen before 
62 .® We encode this requirement as two irreflexivity constraints. 
First, we wish to rule out reading from an SC write that is not the 
most recent in S; that is, we wish to forbid cycles of the shape 

depicted below left, where Sioc S PI =ioc- Axiom S3 does this, 
using the simplified form shown below right. 



Second, we require ei not to read from a write that happens before 
62 ; that is, we wish to forbid cycles of the shape depicted below left. 
Axiom S4 does this, using the simplified form shown below right. 


rf ^- W 

\hh simpli- 

^loc \ {S ; [W] ; Siof ^ figg (Q 


rf^ 

/S \ {mo ; S) 


\hbl 

W 


Axioms S5, S 6 and S7 govern SC fences. If a read ei of a 
location I is sequenced after an SC fence, then ei must not read 
from a write to I that is mo-earlier than the last write to I that 
precedes the fence in Sp^In fact, ‘the last write’ here can be safely 

(20] (§1.10:12)] ‘5 [21| (§5.i.2.4:7)], (2T| (§5.1.2.4:22)], 

I20l i§l. 10:17-18)1 The specification uses the ‘visible sequence of 
side effects’ to phrase this clause (2l](§5. 1.2.4:22)], but Batty (6|(§5.3)] 
has proved that ‘happens after’ suffices. |21| (§7.17.3:12)] 

(21] (§7.17.3:6)] IS] (§7.17.3:6)], (20](§29.3:7)], (22](§29.3:7)] 

2 ° (2l](§7. 17.3:9)] 










generalised to ‘some write’, because being mo-earlier than some 
write to I that precedes the fence in S implies being mo-earlier than 
the last write, since mo is total (S5). If a write 62 to location I is 
sequenced before an SC fence, then any SC read of I that follows 
the fence in S must not read from a write to I that is mo-earlier than 
62 (S 6 )[^Finally, if a read ei of location I is sequenced after an 
SC fence, and a write 62 to I is sequenced before another SC fence 
that precedes the first fence in S, then ei must not read from a write 
mo-earlier than 62 (57)13 

A final axiom formalises what it means for an execution to 
exhibit a fault. 

Definition 12 (Faultiness). A candidate execution {X, w) is faulty, 
written faulty(X, w), if it is consistent and does not satisfy the 
following axiom: 

empty(dr). (Dr) 

If any basic execution can be extended to a faulty candidate 
execution, then the entire program’s behaviour is ‘undefined’ and 
any execution is allowed. Otherwise, the allowed executions are 
those basic executions that can be extended to a consistent candidate 
execution. 

Definition 13 (Allowed executions). Given a set Xs of a program’s 
basic executions, we obtain the program’s allowed executions as: 

allowed(A's) if 3A € A^s. Bin. faulty(X, tn) then X 
else {X € Xs \ 3ui. consistent(A, w)} 

3. Overhauling the SC Axioms in Cll 

The rules for SC axioms in Cl 1, as demonstrated in the previous 
section, are highly convoluted. In this section, we describe how 
these rules can be improved in two fairly orthogonal ways. In O 
we describe how the total order over SC operations can be replaced 
with a partial order; this simplification will be demonstrated in § 6 . 2 | 
to dramatically improve the efficiency with which the model can 
be simulated. In we describe a slight strengthening of the 
model that enables significant simplifications to be made. These 
simplifications lead to a model that is easier to understand, and 
should prove easier to work with in a formal setting. 

3.1 Reducing S from a Total to a Partial Order 

We observe that all but one of the seven SC axioms (Def. \n} can 
be written in the form irr(S' ; r) for some relational expression r. 
These r’s can be seen as the constraints on the total order S. Axiom 
S4 is not quite of this form. However, replacing its ‘S \ {mo ; S)’ 
with just ‘S’, to obtain the axiom S4a given below, happens to 
coincide exactly with an amendment to the model already proposed 
by Vafeiadis et al. to lend the model more desirable mathematical 
properties (3^(§4.2)]. 

irr(5'; r 4 ) (S4a) 

Where axiom S4 forbids an SC read to observe any write that 
happens before the most recent SC write in S, axiom S4a forbids it 
to observe any write that happens before any SC write in S. Let us 
assume here that the uncontroversial amendment of Vafeiadis et al. 
will be accommodated by the C standards committee. 

Lemma 14 (SC order extension principle). For any relation r, there 
exists a strict total order S over all SC events that is compatible 
with r, if and only if r, when restricted unequal SC events, is acyclic. 
That is: 

(35. WfS A irr(5 ; r)) = acy(SC^ \ id n r). 


Proof. This follows from the well-known order extension principle: 
that any (strict) partial order can be extended to a (strict) total 
order. □ 

We are now in a position to replace the seven irreflexivity axioms 
with a single acyclicity axiom. 

Theorem 1. There exists a strict total order on SC events that 
satisfies axioms SI, S2, S3, S4a, S5, S6, and S7, if and only if 
the following Spartiai axiom (which states that the union of all the 
constraints on S, when restricted to unequal SC events, is acyclic) 
holds: 

acy(SC^ \ id n (n U r 2 U ra U r 4 U rs U re U ry)) (Spaniai) 
That is: 

(35. WfS A SI A S2 A S3 A S4a A S5 A S6 A S7) = Spartiai- 
Proof. 

35. WfS A SI A S2 A S3 A S4a A S5 A S6 A S7 

= [basic properties of relations] 

35. WfS A irr(5 ; (n U ra U rs U r 4 U rs U re U ry)) 

= [by Lemma[T4]with r instantiated to ri U • • • U ry] 

acy(SC^ \ id n (ri U ra U ra U r 4 U rs U re U ry)) □ 

Having replaced axioms S1-S7 with the new Spartiai axiom, we 
no longer require the 5 relation in execution witnesses. Memory 
model simulators, such as Herd, typically work by enumerating all 
executions of a program and then filtering out the consistent subset. 
Removing the need to iterate through all possible total orders of 
SC events - a computation that is exponential in the number of SC 
events - allows simulation performance to be greatly improved, as 
demonstrated in 36 .2| 

3.2 A Stronger and Simpler SC Axiom 

We now show that it is possible to strengthen the SC semantics 
without requiring changes to the compilation schemes of any of the 
Cll target architectures that have an established formal memory 
model, that is: x86 and Power. The strengthening we propose 
simplifies the Spartiai axiom significantly and provides stronger 
guarantees to the programmer. 

The proposal for this simplification arises from the observation 
that the relations considered in the Spartiai axiom are nearly symmet¬ 
ric in hb, mo and fr. In particular, both hb and mo constrain the 
5 order between any combination of SC fences and atomics. The 
treatment of fr is different: for fr edges that begin or end at a fence, 
the axioms S5, S6 and S7 ensure that the SC order is constrained to 
match. When two SC atomics are related by an fr edge (S3 and S4), 
ordering is only provided when the intermediate access that forms 
the fr is itself an SC atomic (rule S3), or when the mo edge from 
the intermediate access of the fr to its target is also covered by a hb 
edge (rule S4a). 

Our proposal is to strengthen the Spartiai axiom, to add these 
missing constraints so that every fr edge between SC atomics 
contributes to the 5 order. We achieve this in our model by removing 
the [sc] restriction from S3, which results in the following axiom: 

irr(5;/r). (S3a) 

This change permits a significant simplification to the SC rules that 
we establish in the following theorem. 

Theorem 2. If rule S3 is replaced by S3a (that is, ifrs is replaced 
with fr in the Spartiai axiom) then Spartiai becomes equivalent to: 

acy(SC^ \ id n {Fsb^ ; {hb U /r U mo) ; sbF^)). (Ssirrp) 


2' |2T](§7.17.3:10)] 22 (2T](§7. 17.3:11)] 



That is: 

acy(SC^ \idn (ri U r2 U /r U r4 U rs U re U rr)) = Ssimp- 
Proof. 

ri U r 2 U /r U r4 U rs U re U rr 
= [unfolding definitions and combining fr, rs, re and ry] 
hb U {Fsb^ ; mo ; sbF^) U {Fsb^ ; fr ; sbF^) U r 4 
= [since r 4 C fr, by WfMo] 

hb U {Fsb- ; mo ; sbF') U (Fsfc- ; fr ; ) 

= [since hb = {Fsb^ ; hb ; s6 -F’)] 

; {hbU frU mo) ; □ 

Programming impact. The change presented here does strengthen 
the memory model; there are executions that were previously 
allowed that are now forbidden. The simplest we found, which 
is similar to one used by Vafeiadis et al. l39l fFig. 6)], is presented in 
Examplej^ We believe Examplej^to be a counterintuitive execution, 
because the read event i does not observe the most recent write to x 
in S (namely, /), but c, which is mo-earlier than /. The execution 

S fp 

is forbidden by axiom S3a because of its / —> i —> f cycle. 
Although the current Cl 1 model allows this execution, mapping this 
example to the formalised targets of C11 (Power and x86) never 
yields programs that exhibit it. 

3.3 Soundness of Existing Cll Compilation Schemes 

There are two Cll targets with formal architectural memory models: 
x86 and Power. In this subsection, we establish that for both of 
these architectures, the strengthening does not require a stronger 
compilation mapping. In both cases, we rely on an existing proof 
of soundness from the literature. We need only establish that our 
strengthened Ssimp axiom holds. 

To establish the soundness of our strengthening for x86, we build 
on the soundness proof of Batty et al. (2l, which uses the axiomatic 
model of x86 of Owens et al. (m. To obtain soundness for Power, 
we build on the soundness proof of Batty et al. m , which uses the 
operational Power model of Sarkar et al. (m. 

Theorem 3. Let P be a Cll program that has no faulty executions. 
If we compile P to x86 according to the mapping given by Batty 
et al. then every valid x86 execution corresponds to a Cll 
execution where Ssimp holds. If we compile P to Power according to 
the mapping given by Batty et al. id, then every valid Power trace 
is observationally equivalent to a Cll execution where Ssimp holds. 
[Proof in 

Remark 15 (Soundness of the ARMv8 compilation scheme). At 
the time of writing, work to formalise the ARMv8 specification, and 
how it implements Cll, is ongoing CS). We understand that it is 
not currently clear whether the specification is intended to allow or 
forbid behaviours like our Example]^ and whether the effects of 
this decision on the Cll memory model are understood. As such, 
we see our work as a timely intervention in the ongoing argument 
about how this particular aspect of the ARMv8 specification should 
evolve and be formalised. 

3.4 Effect on the Standard 

We give below a suggestion for how the wording of the standard 
could be changed to accommodate our proposal. Our text, which 
replaces paragraphs 6 and 9-11 of section 7.17.3, is considerably 
shorter (80 words rather than 276) while preserving the style 
and terminology of the original. We have retained the total order 
S in our wording, because we believe it is more intuitive for 
programmers than an acyclicity condition. Nonetheless, we enable 
efficient simulation of this model via the Ssimp axiom (which is 


equivalent to the total order formulation, thanks to Lemmap^with 
r instantiated to Fsfc' ; (hbUfrUmo) ; sbF ). 


1. A value computation A of an object M reads before a side 
effect B on M if B follows, in the modification order of M, 
the side effect that A observes. 

2. If X reads before Y, or happens before Y, or precedes Y in 
modification order, then X (and any fences sequenced before 
X) is SC-before Y (and any fences sequenced after Y). 

3. There shall be a single total order S on all memory_ 
order_seq_cst operations, consistent with the SC-before 
order. 


Summary. This section has described how, having strengthened 
the original set of axioms (SI through S7) to use Vafeiadis et 
al.’s S4a in place of S4, the behaviour of SC operations can be 
captured by a single axiom (Spartiai) that allows the total order S to 
be eliminated from the model. Moreover, if the axioms are further 
strengthened to use our S3a in place of S3, then that axiom can be 
greatly simplified (Ssimp), while still respecting current compilation 
schemes. 

4. Formalising the OpenCL Memory Model 

A principal aim of the OpenCL initiative is to provide functional 
portability across a plethora of heterogenous many-core devices. 
The standard is implemented by CPU, GPU and FPGA vendors, 
and aims to allow applications to be device-agnostic. The OpenCL 
memory model, introduced in the 2.0 revision of the standard, 
is inherited from that of Cll, but is specialised and extended 
for heterogeneous programming. The memory model is the sole 
mechanism for correctly implementing fine-grained concurrent 
algorithms in a device-agnostic manner. Rigorous foundations for 
this model are thus vital. 

We now describe how our formalisation of the Cll memory 
model (^ ^ can be extended to yield the first mechanised 
formalisation of the full OpenCL memory model. We describe 
the form of OpenCL programs ( §4.1| l, their executions ( |4.2[l , and 
the axioms against which these executions are judged ( g4.3^ . We 
then discuss some interesting features of the memory model: some 
innocuous quirks ( |4.4[ l and some serious shortcomings ( ^4.5[ l. The 
most serious shortcoming relates to the axioms that govern SC 
atomics, and we propose how to fix this in ^ 

For reasons of space, and because they are orthogonal to the 
thrust of our contrihutions, we omit our treatment of barrier synchro¬ 
nisation operations and the associated issue of barrier divergence. 
As with the omitted Cll features mentioned in ^ our . cat-based 
formalisation of the OpenCL memory model, provided on our com¬ 
panion webpage nn . fully accounts for these features. 

4.1 OpenCL Programs 

Definition 16 (Structure of OpenCL programs). Building on Def.|^ 
we consider OpenCL programs of the form 

^ ~ lllldgD IILsW ll(gTPd,™,t 

where D, W, and T are sets of device, work-group, and thread 
identifiers, and each pd,w,t is a piece of sequential code. 

Using the notation above, we can write p |||| p' to denote a litmus 
test comprising two threads to be executed on different devices, 
p III p' for two threads in different work-groups in the same device, 
and p II for two threads in the same work-group. We can also write, 
for example, pi || p 2 ||| pa || P4 |||| Ps || Pe ||| P? || Ps, to denote a litmus 
test comprising two devices, each executing two work-groups, each 
containing two threads. 



Remark 17 (Limitations). This program structure does not account 
for sub-groups, an optional extension in OpenCL 2.0 that allows 
threads to synchronise with one another at a level of granularity finer 
than that of a work-group|^nor for further non-OpenCL threads 
(e.g., POSIX threads) running on the host platform. Moreover, w 
and t can actually be 1-, 2-, or 3-dimensional vectors, but we make 
the simplifying assumption that all identifiers are natural numbers. 

Recall that locations in Cll are either non-atomic or atomic 
(Def. [TJ. OpenCL locations are further declared to reside in a 
memory region. 

Definition 18 (Memory regions). We have regional) £ {local, 
global, global_f gb} for every location I, wher e fg b stands for 
fine-grained shared virtual memory (SVM) bufferp"*| There is one 
local region per work-group, containing locations accessible only 
to that work-group. Locations in the global or global_f gb region 
are accessible to all devices. Fences can be performed either on 
the global memories (global and global_fgb) or on the local 
memory, or both simultaneously. 

The distinction between global and global_fgb locations 
is that the former must not be shared between different devices, 
while the latter enable inter-device communication. Unlike Cll, 
in which any memory location can be shared between threads, the 
OpenCL memory model physically prevents certain sharing patterns. 
For instance, threads from different devices are forbidden from 
conflicting on global memory, but are able to do so as a result of a 
programmer fault; in contrast, threads from different work-groups 
are unable to conflict on local memory: the language provides no 
mechanism through which such a conflict can arise. 

Definition 19 (Memory scopes). Atomics in OpenCL are parame- 
terised by a memory scope. The three options are 

s WG (work-group scope) 

I DV (device scope) 

I ALL (system scope). 

A memory scope specifies how widely visible the effects of the 
operation should be. 

Example 4. The use of memory scopes is illustrated by the follow¬ 
ing code, which implements the message-passing idiom between 
two threads in the same work-group. 

global int *x; global atomic_int *y; 

*x = 42; if(load(y,ACq,WG)==l) 

store(y,1,REL,WG); r = *x; 

Since all accesses to the global location y come from the same 
work-group, those accesses can be performed at WG scope (which 
means that on implementations where each work-group caches 
global memory, it suffices to read/write those cached values). This 
scope would be insufficient, and the program deemed faulty, if the 
threads were in different work-groups - both scopes would have to 
be upgraded to DV. 

4.2 OpenCL Executions 

OpenCL executions extend Cll executions as follows. 

Definition 20 (OpenCL event labels). We extend Cll event labels 
(Def.|^ with an additional scope attribute, which assigns a memory 
scope s to all atomic events. We also subdivide the F label in order 
to represent fences on global (Fq), local (Fl) and both-global-and- 


Sub-groups have become a core feature in the recent OpenCL 2.1 
specification I23l (p. 22)]. OpenCL also provides private regions, 
each accessible only to one thread, and a read-only constant region, but 
neither of these are interesting from a memory modelling perspective. 


local memory (Fql). The updated table is as follows: 
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Definition 21 (OpenCL executions). An OpenCL execution is a 
tuple [E, /, Ibl, thd, t™, dv, sb) where {E, I, Ibl, thd, sb) is a Cl 1 
execution as in Def. and wg,dv C (F\/)^ are equivalence 
relations on non-initim events that relate events from the same 
work-group and device, respectively. In order to enforce the privacy 
of local locations to a single work-group, we require that if 
loc{e) = loc{e') = I and region{l) = local, then (e, e') G wg. 

Definition 22 (Derived sets and relations). In the context of an 
OpenCL execution {E, /, Ibl, thd, wg, dv, sb), we define 

def 

fgb = {e € i? \ F I region{loc{e)) — global_fgb} 

G =' (e G F I kind{e) G {Fg,Fgl}} U 

{e G F \ F I region{loc{e)) = global} U fgb 

L = {e€ F\ kind{e) G {Fl, Fgl}} U 

{e G F \ F I region{loc{e)) — local} 

as the sets of events that access, respectively: fine-grained atomic 
SVM buffers, global memory, and local memory. Also, for each 
scope s, we abbreviate the set {e G A | scope{e) = s} as just s. 

Definition 23 (OpenCL candidate executions). Candidate execu¬ 
tions in OpenCL, and their well-formedness, are defined in the same 
way as in Cl 1 (Def.|^. 


4.3 OpenCL Axioms 

We now define and discuss the consistent and faulty predicates 
for the OpenCL memory model, paying particular attention to each 
of the departures from Cll. We justify our formal definitions by 
reference to the OpenCL specification |23l, writing n/m to denote 
line m on page n. 

Definition 24 (Further derived sets and relations). In the context of 
a candidate execution (F, 7, Ibl, thd, sb, rf, mo, S), we define the 


following 

subsets of F and relations over F: 

incl 

def 

(WG^ n wg) U (DV^ n dv) U ALL^ 

rsw{r) 

def 

([r n rel] ; Fsfc’ ; [ W 0 A] ; rs’ ; [r] ; r/ ; 

[F n A] ; s6F’ ; [r n acq]) 0 incl \ thd 

gsw 

def 

rsw{G) U {rsw{L) n (SC^ U (G O in F)^)) 

Isw 

def 

rsw(L) U {rsw{G) 0 (SC^ U (G O in F)^)) 

ghb 

def 

(G^n(s6u(/x^/))Ugsu;)+ 

Ihb 

def 

{L^ n [sb U (/ X ^/)) U lsw) + 

ghbl 

def 

ghb n =ioc 

Ihbl 

def 

Ihb Pi —iqq 

gvis 

def 

(W X F) n ghbl \ {ghbl ; [ W] ; ghb) 

Ivis 

def 

{W X R)n Ihbl \ {Ihbl ; [ W] ; Ihb) 

hr 

def 

cnf \ {ghb U Ihb) \ {ghb U lhb)~^ \ incl \ thd 

iddr 

def 

cnf \dv\ fgb^ 

SC-all 

def 

^(F";[SC\(ALLn/ 36 )];F") 

sc-dv 

def 

-(F";[SC\(DV\/ff6)];F") 







Commentary. In OpenCL, only events that have inclusive scopes 
(mcl) can synchronise: either the events have WG scope and are in 
the same work-group, or they have DV scope and are in t he same 
device, or they have ALL scope|^We shall explain in 5 4.5 how this 


notion of scope inclusion is unnecessarily conservative. 

The synchronisation relation (rsw) is parameterised by a region 
r (global or local). The global synchronises-with relation (gsw) in¬ 
cludes events that synchronise on global memoryp^but also includes 
events that synchronise on local memory, providing both events have 
memory order SCp^or both are global-and-local fences ^ Local 
synchronises-with (Isw) is analogous. Example|^shows how syn¬ 
chronisation works in the presence of global-and-local fences. 

Happens-before is partitioned into global and local versions: 
global happens-before (ghb) contains global synchronises-with and 
sequenced-before edges between events on global memorymand 
local happens-before (Ihb) is analogousp^ See Example j^for a 
discussion of the repercussions of this definition of happens-before. 
Visibility is also split into global (pvis) and local (Ivis) versions 
The heterogeneous race (/irLjgeneralises Cl I’s data race {dr, 
Def.|10^, to reflect the fact that in OpenCL, even atomic operations 
can race when memory scopes are used incorrectlyp^If two events 
from different devices conflict on a location that is not in a fine¬ 
grained atomic SVM buffer, then they form an inter-device data race 
{iddry, such races cannot be ruled out by happens-before edgesp^ 
This leaves the sc-all and sc-dv relations. In OpenCL, the total 
order S is only required to exist when 

SCCALLn/t/& or SC (Zm\fgb. 


The first condition holds when every SC event has ALL scope and 
accesses a global_f gb locationp^the second holds when every SC 
event has DV scope and does not access a global_f gb locationp^ 
The relation sc-all (resp. sc-dv) is the universal relation if the first 
(resp. sc-dv) condition holds and is the empty relation otherwise. In 
|4.5| we shall criticise these conditions as being simultaneously too 
strong for programmers and too weak for compiler-writers. 


Definition 25 (Consistency axioms in OpenCL). There are nine 
consistency axioms. Departures from the Cl 1 consistency axioms 
(Def.|l 1[) are highlighted. 


m(ghb) 

(0-HbG) 

itr(lhb) 

(0-HbL) 

irr((r/"^)- ; mo ; rf ; ghb) 

(0-CohG) 

irr((r/"^)- ; mo ; r/' ; Ihb) 

(0-CohL) 

irr(r/ ; {ghb U Ihb)) 

(0-Rf) 

empty((r/ ; [G IT nal]) \ gvis) 

(0-NaRfG) 

empty((r/ ; [LT nal]) \ Ivis) 

(0-NaRfL) 

irr(r/ U {mo ; mo ; rf~^) U {mo ; r/)) 

(0-Rmw) 

acy(SC2 \id n {sc-all U sc-dv) T 

{Fsb^ ; {ghb U Ihb U /r U mo) ; sbF^)) 

(0-Ssimp) 


Commentary. Both happens-before relations are required to be 
acyclic (0-HbG, 0-HbL)OpenCL requires coherence for both 
global and local happens before separately (0-CohG, 0-CohL)p^ 
The axioms governing the reads-from relation are carried over 

25 (m (47/16-26)] 26 |23| (5i/i_9)] 27 (51/32-33)] 

2* EH (54/13-16)] 29 [23] (49/3^7)] 30 [gj (49/8_ii)] 

2* EH (49/21-26)] 22 Yhis terminology is due to Hower et al. 1181 . 

22 EH (49/29-33)] 24 ( 23 | (58/24-27)] 25 ( 23 ] ( 51 / 15 - 17 )] 

26 EI1(5 1/18-20)] 27 (231(49/12-13)] 28 (23|(50/1 1-24)] 


from Cll (0-Rf, 0-NaRfG, 0-NaRfL. 0-Rmw). but appropriately 
divided into global and local versions!^ 

OpenCL defines the same SC axioms that we saw in Def. o 
(S1-S7), but uses ghb U Ihb in place of hb. We have incorporated 
into axiom 0-Ssimp the simplifications that we already discussed 
in the context of C11 (©• Intersecting with the sc-all and sc-dv 
conditions means that the acyclicity constraint is only enforced when 
one of those conditions holds FI 


Definition 26 (Faultiness in OpenCL). A candidate OpenCL exe¬ 
cution is faulty if it is consistent and does not satisfy both of the 
following axioms: 

empty (hr) (0-Hr) 

empty{iddr) (0-lddr) 

4.4 Quirks in the Memory Model 

We present three worked examples that illustrate features of the 
memory model that may not be obvious from a cursory glance at 
its axioms. These ‘quirks’ in the model are distinguished from the 
technical shortcomings that we save for |4.5| 

Our first example illustrates an interesting consequence of 
OpenCL’s separation of happens-before into two distinct relations. 

Example 5. Suppose the code of Example|^were changed so that y 
were declared local rather than global. Executions such as the one 
below would then become consistent, which means that a stale value 
of X can be read (event /), even when successful release/acquire 
synchronisation (between d and e) has occurred. 



This execution is consistent because the sb edges no longer induce 
either variety of happens-before, since they link events that act 
on different memory regions. Worse still, there is now a data race 
between c and /, which renders the entire program undefined. 


We learn from Example that a flag in one memory region 
cannot be used to protect data in another region. To address this 
issue, OpenCL provides fences that act on both global and local 
memory simultaneously. These are illustrated in Example]^ 


Example 6 . The following program uses relaxed (RLX) accesses 
on the local flag y, relying instead on the fences to synchronise the 
threads and enable the global data x to be passed. 


global int *x; local atomic_int *y\ 


*x = 42; 

fence(GL,REL,WG); 
store(y,1,RLX,WG); 


if(load(y,RLX.WG)==l) 
{ fence(GL,ACQ,WG); 
r = *x; } 


The fence instructions successfully prevent the stale value of x 
being read, because the following execution is inconsistent. 


a:Wn.(x,0) 6:Wna(y,0) 
c:W„a(x, 1)\ tno//:R(y, 1,rlx,WG) 

d: Fgl(REL, WG) Fgl(ACQ, WG) 

e: W(y, 1, RLX, WG) h: R„a(x, 0) 


2'’ EIl(49/26-27)], E1](50/8-9)], E1](52/22-23)] E3|(51/14)] 








The execution is inconsistent because it has a cycle h 


mo 


c h, in violation of 0-CohG. Note that c h holds here 
because, firstly, (d, g) is in rsw{L) and hence in gsw and ghb, and 
secondly, (c, d) and {g, h) are both in sb n and hence in ghb. 


In Exaniple|^ we illuminate the relationship between memory 
scopes and non-atomic operations. Since scopes can be used to 
limit atomic operations to certain groups of threads, it is tempting 
to introduce an additional ‘work-item’ scope, WI, and encode non- 
atomic events as atomic events whose scope is limited to the current 
thread. This would make the Wna and Rna labels redundant. An 
ordinary data race can then be cast as a failure of scope inclusion. 
However, the differences between non-atomic and atomic operations 
go beyond racy behaviours, as we shall see in the following example. 



4.5 Problems with the Memory Model 

We present three shortcomings in the OpenCL memory model, 
which we discovered as a direct result of our formalisation efforts. 

Scope inclusion is too strong. The specification provides an 
overly conservative notion of scope inclusion: two events only have 
inclusive scopes if their scopes match exactly. This leads to such 
surprises as the following example. 

Example 8. Suppose the code of Example|^were changed so that 
the store to y now occurs at DV scope, but the load of y remains at 
WG scope. Although the release scope is clearly ‘wide enough’, it 
does not match the acquiring scope, so no synchronisation edge is 
induced. This leads to two data races: both between the non-atomic 
accesses of x, and between the ill-scoped atomic accesses of y. 

A resolution proposed by Gaster et al. is to allow the annotated 
scopes to differ, as long as both are sufficiently wide II17I (§3.111. 
This enables, for instance, a DV-scoped write to synchronise with 
a WG-scoped read in the same work-group. The proposal can be 
formalised in our framework by changing the definition of the incl 
relation (Def.|24[l as follows: 

incU ([WG] ; wg) U ([DV] ; dv) U ([ALL] ; E^) 

new-incl incll n incll~^ 

The idea here is to define a one-sided version of scope inclusion 
first, so that (ei, 62 ) is in incll if ei has a wide enough scope to 
‘reach’ 62 . Requiring this to hold in both directions ensures that both 
events have sufficient scopes, if not necessarily the same. 

The SC axioms are too weak. As encoded in our O-Ssimp axiom 
(Def.|25|l, SC operations in OpenCL are only guaranteed to provide 
SC behaviour when one of the sc-all and sc-dv conditions holds. 


Since these are conditions on the whole program, we have a “clear 
composability problem” jm. 

We find several reasons why these conditions are problem¬ 
atic. First, they mean that the default memory scope (which is 
DV) is not sufficient to ensure SC semantics in all situations. Sec¬ 
ond, any program that includes a WG-scoped SC atomic, such as 
storeCx, 1,SC,WG), immediately violates the conditions. Third, 
the conditions are mutually exclusive, so a program that satisfies 
sc-all can be combined with another that satisfies sc-dv, with the 
result satisfying neither. Finally, consider the following example. 


Example 9. The following program, comprising two threads in 
different work-groups on the same device, has SC semantics, which 
means that it cannot exhibit the relaxed behaviour rO = rl = 0: 


global atomic_int *x, *y; 


a: StoreCx, 1) ; 
b: rO = load(y) ; 


c: storeCy, 1 ) ; 
d: rl = load(x) ; 


Note that the atomic store and load operations default to the SC 
memory order and the DV memory scope, and that condition sc-dv 
holds. However, if global is changed to global_f gb, the relaxed 
behaviour becomes permissible, because neither condition sc-all 
nor sc-dv holds. Condition sc-dv no longer holds now that x and y 
are in fine-grained atomic SVM buffers, and condition sc-all does 
not hold either because the ALL scope is not being used. 


It is jarring that such a small change, from global to global_ 
f gb, can legitimise relaxed behaviours. Worse still, such a change 
may be invisible to the programmer, if they can see only the kernel 
code: the assignment of locations to SVM buffers occurs only on the 
host side, and such locations are only marked in a kernel as global. 

The SC axioms are too strong. Following discussion with mem¬ 
bers of the Khronos OpenCL working group, we understand that the 
purpose of condition sc-dv is to enable efficient implementations of 
DV-scoped SC atomics. The intention of the condition is that if no 
SC atomic accesses memory shared between devices, they can be 
implemented without expensive inter-device synchronisation. It was 
thought not to matter that the specification requires implementations 
to establish a total order between SC events on different devices, 
because it is not possible to observe this order without creating an 
inter-device data race. 

In fact, this is not the case. We present in Example[T0|a program 
that satisfies condition sc-dv, and yet is still able to observe the 
order between SC events in different devices - even though these 
events are DV-scoped and access no memory shared between devices. 


Example 10. Consider the following program, which comprises 
two devices, both executing two threads (stacked vertically). It can 
be thought of as a ‘twisted’ version of the store-buffering test. 


global atomic_int *x, *y; 
global_fgb atomic_int *zl, *z 2 ; 


store(x,1,SC,DV); 
store(zl,1,REL,ALL); 

rl = load(z2,ACQ,ALL)? 
load(x,SC,DV) : 1; 


store(y,1,SC,DV); 
store(z2,1,REL,ALL); 

r2 = load(zl,ACQ,ALL)? 
load(y,SC,DV) : 1; 


Two threads in different devices write, using DV scope, to distinct 
global locations x and y, and then write to global_fgb flags, 
using ALL scope, to signal that they are done. Meanwhile, two 
partner threads try to acquire these signals from the opposite device, 
and if they are successful, they read the location their partner (in 
the same device as they) wrote to. We are interested in whether 
these reads can both obtain 0 ; that is, whether the final state 
{rl = r2 = 0} is allowed. This final state could only be obtained 
via the following execution: 


















where the outer dotted rectangles delimit dv equivalence classes and 
the inner ones delimit thd equivalence classes. 

The execution is inconsistent, and therefore must be forbidden 
by a compiler. To see this, observe that each rf edge induces a 
synchronisation (gsw) edge, and hence global happens-before. Since 
sb edges also contribute to global happens-before, we obtain the 


ghb 
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This makes the execution fall foul of 0-S 
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■n, which is non-vacuous 


here because the condition sc-dv is satisfied. 


That the execution in the example above is not allowed implies 
that OpenCL implementations must make the order of SC write 
operations visible to all devices, even when those writes are only 
performed with DV scope. In other words, the current phrasing of the 
OpenCL memory model demands too much from the compiler- 
writer to permit an efficient implementation of DV-scoped SC 
atomics, while in other respects offering too little to the programmer, 
by guaranteeing SC semantics only when an onerous condition 
holds. 

To summarise: the intent of the Khronos working group was 
to enable efficient implementation of DV-scoped SC atomics by 
compilers, at the expense of programmer inconvenience. Instead, 
our formalisation shows that we have the worst of both worlds: the 
programmer is inconvenienced, and yet a correct compiler is obliged 
to enforce inter-device orderings on DV-scoped SC atomics. 


5. Overhauling the SC Axioms in OpenCL 

We describe how the handling of SC atomics in OpenCL can be 
changed to address the shortcomings identified in §4.5| 

Building on a suggestion by Caster et al. (n](§7.2)], we propose 
to eradicate the stringent conditions on the existence of the SC order 
by simply intersecting the constraints on the SC order with the 
scope-inclusion relation. This essentially means that the orderings 
imposed between events by the SC axioms only take effect if those 
events have inclusive scopes. Under this proposal, which recalls the 
way Cll’s synchronisation relation (sw, Def.|10[l is intersected with 
scope-inclusion when producing OpenCL’s version (rsw, Def.|24|>, 
we do not need to restrict the programmer’s usage of SC atomics to 
certain scopes; instead, the guarantees provided by those SC atomics 
degrade gracefully as their scopes narrow. 

Definition 27 (Proposed SC axiom for OpenCL). The following 
axiom for SC atomics in OpenCL is obtained from O-Ssimp by 
removing the sc-all and sc-dv conditions and instead intersecting 
with inch 

acy(SC^ n {Fsh^ ; {ghb U Ihb U /r U mo) ; sbF^) D incl) 

(0-S 

scoped) 


5.1 Effect on the Standard 

To accommodate our proposal, we propose that the wording of 
the OpenCL 2.1 standard |23] (51/14-31 and 51/34-52/13)] be 
changed to match the text given in |3.4[ but with ‘happens before’ 
replaced with ‘global or local happens before’, and ‘consistent with 
the SC-before order’ replaced with ‘consistent with the SC-before 
order restricted to operations with inclusive scopes’. This replaces 


OpenCL atomic operation Assembly instructions 


O r = load(a:,SC,WG) 

@ r = load(a;,SC,DV) 

© store(a:, r, SC, WG) 

0 store(a;, r, SC, DV) 

© r = f etch_inc(a:, SC, WG) 
© r — f etch_inc(a;, SC, DV) 


LD r a; 

INVli ; LD r X ; INVli 
ST r X 

FLUli ; ST r X ; FLUli 
INCli r X 

FLUli ; INCl 2 r x ; INVli 


Table 1. Compiling the revised OpenCL memory model 


391 words with 89 words, while retaining the standard’s style and 
terminology. 

5.2 Implementability of the New SC Axiom 

The new 0-Sscoped axiom is stronger than the original O-Ssimp 
axiom, so we must confirm that our proposal does not place undue 
demands on compilers that implement the memory model. 

The only published compilation scheme of the OpenCL 2.0 mem¬ 
ory model of which we are aware is that published by AMD l30l and 
later formalised by Wickerson et al. ED . The scheme compiles the 
release/acquire fragment of OpenCL atomics, and its soundness has 
been verified against an operational model of an AMD GPU ED- 
In this subsection we describe how the scheme can be extended to 
support SC atomics, and we demonstrate via a series of examples 
that the extended scheme meets the requirements of our revised SC 
axiom. The original compilation scheme does not cater for multiple 
devices, and does not include fences, and we do not attempt here to 
extend the scheme to cover these features. As such, this scheme does 
not engage directly with the problems of inter-device SC atomics 
that we noted in the previous section; however, it does illustrate how 
WG- and DV-scoped SC atomics can co-exist. 

The AMD compilation scheme. The operational model is quite 
simple. Each work-group has its own LI cache, and each device has 
its own L2 cache. Since the compilation scheme considers only the 
single-device case, the L2 cache can be safely thought of as the main 
memory. No instruction reordering is permitted. At any time, the 
environment can flush a dirty LI cache entry to the L2 (and thereby 
make it clean), can fetch an L2 entry to replace a clean LI entry, and 
can evict a clean LI entry. 

The semantics of the various assembly instructions can be 
summarised as follows. LD r x loads into register r from the nearest 
cache that contains a valid entry for x; ST r x stores from r into 
the local LI cache, first flushing x’s entry therein if it is invalid; 
INCli r x increments x in the local LI cache; INCl 2 r x increments 
X in the L2 cache, first flushing any dirty entry for x in the local 
LI cache; FLUli flushes all ditty entries in the local LI cache; and 
INVli marks all entries in the local LI cache as invalid. 

The extensions to the compilation scheme are given in Tab. 
Here, fetch_inc stands for ‘atomic fetch and increment’, and 
provides a representative of RMW operations in OpenCL. 

Correctness of the compilation scheme. Most of the flush and 
invalidate instructions in the compilation scheme are necessary 
to ensure correct release/acquire semantics. For SC atomics, we 
need add only two further instructions: the INVli before the load 
in row @, and the FLUli after the store in row ©. The need for 
these instructions can be motivated by considering the following 
two examples, which correspond to the classic store-buffering and 
IRIW litmus tests. 

The memory model requires the program in Example [9] not to 
produce the final state rO = rl = 0. With only release/acquire 
semantics, the compilation scheme inserts no flush or invalidate 
instructions between the store and the load in each thread, and 
the relaxed behaviour can be observed: both threads might pre¬ 
fetch X = y = 0 into their respective LI caches (the threads are 






































in different work-groups, so they have different LI caches), then 
perform their stores, and finally load the LI-cached values of x and 
y. However, placing a FLUli after the store and an INVli before the 
load ensures that no sequence of fetching and flushing can lead to 
the relaxed behaviour. We do not need FLUli or INVli instructions 
before or after the SC increment instruction, because INCl 2 writes 
directly to the L2, invalidating the LI as it does so. 

The memory model also requires the IRIW litmus test 


global atomic_int *x; global atomic_int *y; 


storeCx,1); 


storeCy,1); 


rO=load(x); 
rl=load(y); 


r2=load(y); 
r3=load(x); 


not to produce the final state {rO = r2 = 1, r 1 = r3 = 0}. (Recall 
that these store and load operations use memory order SC and 
scope DV by default.) Here, an INVli instruction between each pair 
of loads is sufficient to rule out such executionsPI 


6. Simulating the Memory Models with Herd 

Our overhaul of SC atomics avoids the requirement for the S relation 
to be explicitly constructed in execution witnesses. Our hypothesis 
was that this would lead to improved efficiency in the process of 
exhaustively enumerating the allowed behaviour of litmus tests 
that use SC atomics. We now explain how we extended the Herd 
memory model simulator in order to enable investigation of Cl 1 
and OpenCL litmus tests ( §6.1| ), and present experimental results 
using Herd to compare the efficiency of simulation before and after 
our overhaul, and also in comparison to the CDSChecker memory 
model simulator 1291 ( |6.21 . For a family of litmus tests derived from 
Dekker’s algorithm, our results show that our revised axioms lead 
to an exponential speedup in simulation time using Herd, bringing 
performance using Herd, which is general-purpose and exhaustive 
on loop-free programs, much closer to that of CDSChecker, which is 
specifically tuned for the Cl 1 memory model and is not guaranteed 
to be exhaustive, even on loop-free programs. 

6.1 Extensions to Herd 

The version of Herd described by Alglave et al. (nia supports 
only assembly code: sequences of labelled instructions and gotos. 
In order to simulate our formalisations of the Cll and OpenCL 
memory models, we have extended the . cat format to support the 
definition of faulty axioms, and the Herd tool with both a routine 
for alerting the user when a faulty execution is detected and a module 
for translating Cll and OpenCL programs into their executions. 

We model only a small fragment of the Cl 1 language: enough to 
encode the litmus tests we found useful for testing our formalisation. 
We exclude, for example, the address-of operator, compound types, 
and function calls. We include if and while blocks, pointer deref¬ 
erencing, simple expressions, and built-in atomic functions such as 
atomic_thread_fence (Cll) and atomic_work_item_fence 
(OpenCL). 

6.2 Simulating the Cll Model: Performance Evaluation 

We now compare the performance of Herd in enumerating the 
behaviours of litmus tests (a) when equipped with the original SC 
axioms in Cl 1 vs. (b) when equipped with our revised SC axioms. 
We also provide performance results gathered using CDSChecker, a 
custom-built simulator for the Cll memory model (2^ . The Herd 
tool guarantees exhaustive enumeration of allowed behaviours for 
a loop-free litmus test; CDSChecker aims for high coverage of 
behaviours, but is known to be non-exhaustive in general 1291 . 

Recall that Dekker’s mutual exclusion algorithm m is a key 
use case for SC atomics. The essential idiom underlying an N- 

On weaker models, such as Power, that are not multi-copy atomic (m, 
further synchronisation would be required between the loads. 



Eigure 1. Time to simulate an Al-threaded store-buffering test 


threaded version of Dekker’s algorithm is captured by the following 
iV-threaded store-buffering litmus tests: 

p def /store(x 2 ,1 ); 11 store(x3 ,1 ); 11 ... 11 store(xi , 1 ); \ 

^ \ri=load(xi); 11 r 2 =load(x 2 ); 11... 11 rjv=load(xjv); / 

that operate on a collection {xi,...,xiv} of atomic locations 
initialised to zero. Recall that atomic store and load operations 
use memory order SC by default. Dekker’s algorithm requires that it 
is not possible to observe the final state where ri = • • ■ = = 0; 

only SC is strong enough to rule out this relaxed behaviour. 

We use the family (Piv)AfeN to assess the scalability of the 
two versions of Herd and of CDSChecker. Figure [T] shows the 
time each tool takes to simulate Pn as N increases^ [Experiments 
were conducted on a 3.1 GHz MacBook Pro, and each data point 
represents the mean of ten runs. We do not include error bars because 
the standard deviation is negligible. The original memory model, 
naively implemented in Herd, times out on just 4 threads. This is 
because it iterates over all {2N)\ orders of the 2N SC events that are 
in every execution of Pm ■ When Herd is provided with our revised 
memory model, simulation times greatly improve. Bearing in mind 
the logarithmic y-axis, the performance of both Herd on the revised 
memory model and CDSChecker appears to scale exponentially 
with N, which meets expectations since Pm has 2^ — 1 unique final 
states. Still, CDSChecker significantly outperforms Herd when 
simulating Pm, and on several other programs that we tried. This is 
because CDSChecker, unlike Herd, is optimised specifically for 
the Cll memory model, through the use of such techniques as the 
early elimination of infeasible executions, and a variant of dynamic 
partial order reduction (DPOR) 113 on the S order. In fact, we 
conjecture that the use of DPOR here has an effect similar to our 
proposal to rephrase the memory model with S' as a partial order. 

Figure[T]demonstrates that simply by tweaking the axioms that 
define the memory model, simulation time can be dramatically 
decreased, without the need to implement complex optimisations, 
such as DPOR, that make it difficult to assess the soundness and 
completeness of the tool. It happens that CDSCheckerii exhaustive 
on all of our Pm programs, but we remark that we can only be sure 
of this because of Herd. 

7. Related Work 

The Cll memory model has been formalised several times. Batty 
et al. CD present a comprehensive formalisation using Lem 1281 . 
Vafeiadis et al. IMESI and Batty et al. (9] have also formalised 

We used HERD revision 88ffl89 (http://github.com/herd/ 
herdtoolsi and CDSChecker revision 7c51087 (git://demsky. 
eecs. uci. edu/model- checker. git I. 




















slightly simplified variations. Alglave et al. have formalised a re¬ 
lease/acquire fragment of the Cl 1 model (without release sequences, 
fences, non-atomics, or data races) in the . cat language, and have 
shown it to be an instance of their generic axiomatic memory 
model m We use the . cat language in our work too, but our 
comprehensive model, which incorporates undefined behaviours 
and a richer language of events, no longer fits within their generic 
framework. 

We remark that in the absence of fences, our Ssimp axiom (see 
Theorem]^ forbids the same dependency cycles that Shasha and 
Snir characterise as violations of sequential consistency ED. In a 
sense, one contribution of our paper is to simplify the semantics of 
Cl I’s SC atomics to the point where it can be defined, for the first 
time, in the Shasha-Snir style. 

Criticisms of the Cll model. Batty et al. describe a fundamental 
problem in the structure of the Cll, C-l-l-11, C-l-l-14 and OpenCL 
memory models: the so-called “thin-air” executions Idol . This is a 
difficult open problem requiring a radically different approach; we 
do not address it here. 

Vafeiadis et al. note that the current rules governing SC atomics 
break desirable properties of the memory model, harming the 
prospect of reasoning above it, and they propose a strengthening 
of the model to fix this 1391 . Our proposal builds on theirs (®, 
but goes further ( ^3.2^ , arriving at a substantially simpler model. 
A similar proposal was in fact considered by Vafeiadis et al. in the 
context of the original total-order SC axioms 1391, but abandoned 
over concerns that it would invalidate the existing Power compilation 
scheme. In our work, we have demonstrated that such a proposal is 
in fact valid on Power (and x86). 

We note that despite our strengthening, SC fences remain too 
weak to restore sequential consistency in all circumstances, even 
when placed between every pair of accesses. This weakness was 
intentional in Cll to permit efficient implementation over Intel’s 
Itanium architecture 161 , but it does harm programmability. Lahav et 
al. (251 have proposed an alternative implementation of SC fences, 
in terms of acquire/release RMWs on a distinguished location, that 
always restores sequential consistency. 

The OpenCL 2.0 memory model has recently been described by 
Caster et al. (m, as an instance of a heterogeneous race-free (HRF) 
model d. Our work improves on theirs in several ways. A key 
shortcoming of their work is its relative informality: it lacks the 
mathematical precision that is required to resolve all the details 
of the OpenCL memory model. Our formalisation, in contrast, is 
precise enough to be executed by a machine (cf. Moreover, 
their characterisation of the OpenCL memory model has several 
technical issues. It replaces the specification’s modification order 
(which orders atomic write events) with a coherence order (which 
orders both read and write events) without proving that the intent of 
the specification is preserved by this change. Another infidelity to the 
specification is the omission of release sequences, which prohibits 
the correct treatment of release-fences. Indeed, Caster et al. include 
no formal treatment of fences at all, describing their behaviour only 
in prose. Our . cat presentation of the OpenCL memory model 
treats release sequences and fences in full. Its informality aside. 
Caster et al.’s work contains numerous insights into the design and 
workings of the OpenCL memory model, and provided a valuable 
basis for our formalisation efforts. 

We have already begun to build on top of the formalisation of the 
OpenCL memory model presented here, as part of our investigations 
into the semantics of a proposed extension to OpenCL called remote- 
scope promotion ED- That work, which has already been published, 
describes only a small ‘release/acquire’ fragment of the OpenCL 
memory model, while the current paper describes the full model, 
including the interesting and important SC and relaxed atomics. 


Implementations of the OpenCL memory model. AMD and Intel 
have recently released OpenCL 2.0-compliant implementations 0 
[IS). We are aware only of one implementation of the OpenCL mem¬ 
ory model that has been formalised: namely, a compilation scheme 
from OpenCL (extended with a feature called remote-scope promo¬ 
tion EHI) to a model of next-generation AMD CPUs (4T1 . Alglave 
et al. present an experimentally-validated axiomatic model of an 
Nvidia CPU (3), which could provide another compilation target for 
our OpenCL memory model. However, we find that their model is 
too weak to admit an efficient mapping from OpenCL. Specifically, 
it does not provide the property of cumulativity: synchronisation at 
one scope cannot be chained with further synchronisation at a wider 
scope to induce overall synchronisation between the two end-points. 
Since cumulativity is a property required by the OpenCL memory 
model, we deduce that the OpenCL compiler must, very expensively, 
treat all operations as having the widest scope. 

Memory model simulators other than Herd that are capable of 
handling the Cll model include Cppmem |[7|, Nitpick II13I and 
CDSChecker 1291 . We did not include Cppmem and Nitpick in 
our tool comparison ( §6.2| ) because Norris et al. have already 
demonstrated that CDSChecker’s performance is far superior 1291 . 

Because it is highly optimised for the Cll memory model, 
CDSChecker continues to outperform Herd even on the revised 
model. Herd on the other hand is deliberately designed not to 
be optimised for a particular model, but to be instead a generic 
memory model simulator. A key advantage of using a generic 
memory model simulator like Herd is that it is easy to tinker 
with the model during the development process: one must only 
modify a text file and restart Herd in order to explore the impact 
of a proposed change. Indeed, this ease of modification, together 
with the challenge of expressing the Cll model in the very concise 
. cat language, inspired our discovery of the simpler SC axioms 
described in this paper. Moreover, where CDSChecker is designed 
for efficiency, sometimes at the cost of fidelity to the memory model 
(the lack of self-satisfying conditionals, for instance, is a source of 
incompleteness in CDSChecker), our formalisation and simulator 
are designed primarily to represent the memory model as closely as 
possible. 

CDSChecker obtains its main performance benefits by exploring 
partial modification orders. It is therefore natural to ask whether the 
memory model could be revised to accommodate partial modifica¬ 
tion orders in the same way that we have incorporated a partial S 
order. We believe that this is not straightforwardly possible without 
changing the model: our partial order reduction on S hinges on its 
constraints all having the form irr(S' ; r) for some r, but this is not 
the case for mo - see axiom Rmw (Def. for instance. 

8. Conclusion 

Our overhaul of the semantics of SC atomics and fences provides 
four main benefits in relation to the Cll and OpenCL memory 
models: more efficient exploration of the behaviours of litmus tests 
(cf. ^ refined specification text that we argue is easier 

for programmers and compiler-writers to understand (cf. ^3.4 1 ; 
improved usability of the languages by programmers (cf. ^ i; 
and opportunities for compiler-writers to produce more efficient 
implementations (cf. We argue that our proposed changes to 
the memory models validate all of the formalised Cll and OpenCL 
compilation schemes of which we are aware. 

A topic for future research is the consideration of memory 
consistency between OpenCL devices and the host application that 
launches kernels on these devices; our treatment in this paper focuses 
solely on interactions between kernel threads. We also plan to use 
our memory model as a basis for reasoning about OpenCL programs, 
extending the capabilities of tools such as GPUVerify (H, where 



existing support for atomic operations is limited and not based on 
formal foundations l5l . 

Acknowledgements 

Luc Maranget kindly advised on our extensions to Herd. We thank 
Jade Alglave, Nathan Chong, Benedict Gaster, Vinod Grover, Lee 
Howes, Jeroen Ketema, Matthew Parkinson, Peter Sewell, Tyler 
Sorensen, and our anonymous reviewers for their feedback and 
encouragement. This work was supported by the EPSRC (grants 
EP/KOl 1499/1, EP/I020357/1, EP/K015168/1, and EP/I01236/1), 
and by the EU FP7 project CARP (project number 287767). 

A. Proof of Theorem |3] 

The following theorem states that the x86 and Power compilation 
schemes for Cl 1, as given in Tab.|^ remain sound in the presence 
of our revised SC axiom, Ssimp. 

Cl 1 operation x86 Power 

Or — loadjXjSC) lock xadd(O) sync; Id; cmp; be; isync 
@ store(x, r, sc) lock xchg sync; st 
@ r = f ence(a:, sc) mfence sync 

Table 2. Compiling the Cl 1 SC atomics 

Theorem [^(repeated from ^3.31 . Let P be a Cl 1 program that 
has no faulty executions. If we compile P to x86 according to the 
mapping given by Batty et al. then every valid x86 execution 
corresponds to a Cl 1 execution where Ssimp holds. If we compile 
P to Power according to the mapping given by Batty et al. 
then every valid Power trace is observationally equivalent to a Cl 1 
execution where Ssimp holds. 

Proof (x86 case). The axiomatic model of Owens et al. restricts the 
behaviour of memory using a partial order over x86 memory events 
called memory-order. The proof of Batty et al. (7] constructs the 
relations of the Cll execution using memory-order-, modification 
order {mo) and reads-from (r/), in particular, are projected from it. 
Here we rely on several properties of memory-order as set out by 
Owens et al.: when restricted to writes, it is a linear order; program 
order between two events is included in memory order if there is an 
intervening fence or if either instruction is locked; program-order 
edges from reads to later events are included; and a read observes 
the most recent preceding write in memory-order. 

We proceed by contradiction, showing that given the construction 
of rf and mo used in the proof of Batty et al., any cycle in 
SC^ \idf] {Fsly ; {hb VJfrVJmo) ; sbF^) implies either the existence 
of a cycle in memory-order, or an inconsistent rf edge. 

Any cycle in the relation is made up of mo, hb and fr edges, 
possibly linked with sb edges. The mo, hb and sb edges all imply 
corresponding memory-order edges. To see this, note: 

• memory-order is a linear order over writes; 

• any hb edge in the Ssimp relation begins with either a fence or a 
locked instruction, so sb edges correspond to memory-order 
edges, then there may be a chain of edges in mo ; rf ; sb, where 
the final sb edge is headed by a read, so transitivity implies that 
hb corresponds to memory-order-, and 

• any sb edges are either between locked instructions or have a 
fence between accesses, and so con'espond to memory-order 
edges. 

Finally, if the cycle contains an fr edge, then memory-order cannot 
contradict this: the read would become inconsistent in the x86 
execution. Then for a given Ssimp cycle, we have a sequence of 


memory-order edges that would form a cycle if not for holes 
corresponding to fr edges. 

We now show that there is indeed a cycle in memory-order. 
Consider an fr edge: its head is a read, so the preceding edge must 
either be a sb edge or a hb edge. If it is a sb edge then either the 
head of that edge is a write, or the edge that precedes that is an hb 
edge. In all cases, the read at the head of the fr edge is preceded in 
hb by a write. As memory-order is total over writes, it must order 
this preceding write’s x86 counterpart before the write in the tail of 
the fr edge. We use this fact to construct a cycle in memory-order, 
a contradiction. □ 

Proof (Power case). In their proof. Batty et al. construct a Cll 
execution from a Power trace such that Power coherence and 
rf edges match their constructed Cll counterparts. In proving 
the SC axioms hold over the execution, they prove a property, 
good-sc, that establishes a total order over the SC atomics of 
the execution that contains po, co, fr and an extended variant 
of reads from, erf, each restricted to the SC atomics. The SC 
fences are added to this relation in a way that is consistent with 
CO and rf in the rest of the execution, preserving the invariant 
that it is a strict partial order. The following edges become part of 
the SC order: [SC] ; po’ ; (r/“^)’ ; co ; r/’ ; po’ ; [F n SC] and 
[F n sc] ; po’ ; (c/“^)’ ; co ; rp ; po’ ; [SC]. The proof goes on 
to show that hb restricted to the SC actions is a subset of the total 
order. The construction of mo and rf in the Cll execution follow 
the Power trace directly, so good-sc together with the addition of 
the fence edges, which contain Fsb '^; fr ; sbF^ and f s6’ ; co; sbF^, 
show the acyclicity of SC^ \id H (Fsb^ ; (hb U fr U mo) ; sbF^) 
directly. □ 
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